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Abstract 

This paper introduces the Partition Tree Weighting technique, an efficient meta-algorithm 
for piecewise stationary sources. The technique works by performing Bayesian model aver- 
aging over a large class of possible partitions of the data into locally stationary segments. It 
uses a prior, closely related to the Context Tree Weighting technique of Willems, that is well 
suited to data compression applications. Our technique can be applied to any coding distribu- 
tion at an additional time and space cost only logarithmic in the sequence length. We provide a 
competitive analysis of the redundancy of our method, and explore its application in a variety 
of settings. The order of the redundancy and the complexity of our algorithm matches those 
of the best competitors available in the literature, and the new algorithm exhibits a superior 
complexity-performance trade-off in our experiments. 

1 Introduction 

Coping with data generated from non-stationary sources is a fundamental problem in data com- 
pression. Many real-world data sources drift or change suddenly, often violating the stationarity 
assumptions implicit in many models. Rather than modifying such models so that they robustly 
handle non- stationary data, one promising approach has been to instead design meta-algorithms 
that automatically generalize existing stationary models to various kinds of non- stationary settings. 

A particularly well-studied kind of non-stationary source is the class of piecewise stationary 
sources. Algorithms designed for this setting assume that the data generating source can be well 
modeled by a sequence of stationary sources. This assumption is quite reasonable, as piecewise 
stationary sources have been shown [1] to adequately handle various types of non- stationarity. 
Piecewise stationary sources have received considerable attention from researchers in information 
theory [21, 20, 16], online learning [11, 19, 12, 6, 10, 9], time series [1, 5], and graphical models 
[7, 15, 2]. 

An influential approach for piecewise stationary sources is the universal transition diagram 
technique of Willems [21] for the statistical data compression setting. This technique performs 
Bayesian model averaging over all possible partitions of a sequence of data. Though powerful for 
modeling piecewise stationary sources, its quadratic time complexity makes it too computationally 
intensive for many applications. Since then, more efficient algorithms have been introduced that 
weight over restricted subclasses of partitions [20, 16, 10, 9]. For example, Live and Die Coding 
[20] considers only log t partitions at any particular time t by terminating selected partitions as 
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time progresses, resulting in an 0{n log n) algorithm for binary, piecewise stationary, memoryless 
sources with provable redundancy guarantees. 1 Gyorgy et al. [9] recently extended and generalized 
these approaches into a parametrized online learning framework that can interpolate between many 
of the aforementioned weighting schemes for various loss functions. We also note that linear com- 
plexity algorithms exist for prediction in changing environments [11, 12, 23], but these algorithms 
work with a restricted class of predictors and are not applicable directly to the data compression 
problem considered here. 

In this paper we introduce the Partition Tree Weighting (ptw) technique, a computationally 
efficient meta-algorithm that also works by weighting over a large subset of possible partitions of 
the data into stationary segments. Compared with previous work, the distinguishing feature of our 
approach is to use a prior, closely related to Context Tree Weighting [22], that contains a strong 
bias towards partitions containing long runs of stationary data. As we shall see later, this bias is 
particularly suited for data compression applications, while still allowing us to provide theoretical 
guarantees competitive with previous low complexity weighting approaches. 

2 Background 

We begin with some terminology for sequential, probabilistic data generating sources. An alpha- 
bet is a finite, non-empty set of symbols, which we will denote by X. A string X\X2 ■ ■ ■ x n G X n 
of length n is denoted by a; 1:n . The prefix x\.j of x\ M , j < n, is denoted by x<j or £<j+i. The 
empty string is denoted by e. Our notation also generalises to out of bounds indices; that is, given 
a string xi :n and an integer m > n, we define Xi :m := X\._ n and x m .n '■= e- The concatenation of 
two strings s and r is denoted by sr. 

Probabilistic Data Generating Sources. A probabilistic data generating source p is defined by a 
sequence of probability mass functions p n : X n — > [0, 1], for all n G N, satisfying the compatibility 
constraint that p n (xi :n ) = J2 y ex Pn+i{xi-. n y) for all x\- n G X n , with base case p (e) = 1. From 
here onwards, whenever the meaning is clear from the argument to p, the subscripts on p will be 
dropped. Under this definition, the conditional probability of a symbol x n given previous data x <n 
is defined as p(x n \x <n ) := p(xi :n )/ p(x <n ) provided p(x <n ) > 0, with the familiar chain rules 

p(Xl:n) = H7=lP( X i\ X <i) and P( X i-j I X <i) = Ilk=i P( X k\ X <k) nOW following. 

Temporal Partitions. Now we introduce some notation to describe temporal partitions. A seg- 
ment is a tuple (a, b) G N x N with a < b. A segment (a, b) is said to overlap with another segment 
(c, d) if there exists an i G N such that a < i < b and c < i < d. A temporal partition V of a set 
of time indices S = {1,2,... n}, for some n G N, is a set of non- overlapping segments such that 
for all i 6 5, there exists a segment (a, b) G V such that a < x < b. We also use the overloaded 
notation V(a, b) := {(c, d) G V : a < c < d < b}. Finally, T n will be used to denote the set of 
all possible temporal partitions of {1,2,..., n}. 

Piecewise Stationary Sources. We can now define a piecewise stationary data generating source 
p, in terms of a partition V = {(a%, &i), (ci2, 62), • • • } an d a set of probabilistic data generating 
sources {/x 1 , p 2 , . . . }, such that for all n G N, for all x\ in G X n , 

P(Xl:n) ■= 11 P f{a) (Xa:b), 

(«,f))eP„ 

'All logarithms in this paper are of base 2. 
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Figure 1: The set C 2 represented as a collection of partition trees. 

where V n := {(cij, 6j) E V : a, < n} and /(z) returns the index of the time segment containing i, 
that is, it gives a value k E N such that both (a*., 6fc) £ "P and at < i < bk- 

Redundancy. The ideal code length given by a probabilistic model (or probability assignment) 
p on a data sequence x 1:n E X n is given by — log p(xi : „), and the redundancy of p, with respect 
to a probabilistic data generating source p, is defined as log p(xi :n ) — \ogp(x lm ). This quantity 
corresponds to the amount of extra bits we would need to transmit x 1:n using an optimal code 
designed for p (assuming the ideal code length) compared to using an optimal code designed for 
the data generating source p. 

3 Partition Tree Weighting 

Almost all existing prediction algorithms designed for changing source statistics are based on 
the transition diagram technique of Willems [21]. This technique performs exact Bayesian model 
averaging over the set of temporal partitions, or more precisely, the method averages over all cod- 
ing distributions formed by employing a particular base model p on all segments of every possible 
partition. Averaging over all temporal partitions (also known as transition paths) results in an algo- 
rithm of 0(n 2 ) complexity. Several reduced complexity methods were proposed in the literature 
that average over a significantly smaller set of temporal partitions [20, 10, 9]: the reduced number 
of partitions allows for the computational complexity to be pushed down to 0(n log n), while still 
being sufficiently rich to guarantee almost optimal redundancy behavior (typically O(logra) times 
larger than the optimum, for lossless data compression). In this paper we propose another member 
of this family of methods. Our reduced set of temporal partitions, as well as the corresponding mix- 
ture weights, are obtained from the Context Tree Weighting (CTW) algorithm [22], which results 
in similar theoretical guarantees as the other methods, but shows superior performance in all of 
our experiments. The method, called Partition Tree Weighting, heavily utilizes the computational 
advantages offered by the CTW algorithm. 

We now derive the Partition Tree Weighting (ptw) technique. As PTW is a meta- algorithm, it 
takes as input a base model which we denote by p from here onwards. This base model determines 
what kind of data generating sources can be processed. 

3.1 Model Class 

We begin by defining the class of binary temporal partitions. Although more restrictive than 
the class of all possible temporal partitions, binary temporal partitions possess important compu- 
tational advantages that we will later exploit. 

Definition 1. Given a depth parameter d E N and a time t E N, the set Cd(t) of all binary temporal 
partitions from t is recursively defined by 

C d {t) := {{{t, t + 2 d - 1)}} U {S 1 U S 2 : Si E C d ^ (t) , S 2 E C d _i (t + 2^) } , 



3 



(1,4) (1,4) 
o / \ 1 o/\i 



(1,2) (3,4) (1,2) (3,4) 

0/ \1 0/ \ 1 0/ \ 1 0/ \1 



(1,1) (2,2) (3,3) (4,4) (1,1) (2,2) (3,3) (4,4) 
Figure 2: Partitions updated at t = 2 (left) and t = 3 (right) in a depth-2 partition tree. 

with C (t) := {{(£,£)}}• Furthermore, we define := C<f(l). 

For example, C 2 = { {(1,4)}, {(1, 2), (3, 4)}, {(1, 1), (2, 2), (3, 4)}, {(1, 2), (3, 3), (4, 4)}, 
{(1, 1), (2, 2), (3, 3), (4, 4)} }. Each partition can be naturally mapped onto a tree structure which 
we will call a partition tree. Figure 1 shows the collection of partition trees represented by C 2 - 
Notice that the number of binary temporal partitions \Cd\ grows roughly double exponentially in d. 
For example, \C Q \ = 1, \d\ = 2, \C 2 \ = 5, \C 3 \ = 26, \C 4 \ = 677, \C 5 \ = 458330, which means that 
some ingenuity will be required to weight over all d efficiently. 

3.2 Coding Distribution 

We now consider a particular weighting over d that has both a bias towards simple partitions 
and efficient computational properties. Given a data sequence X\. n , we define 

PTW d (a; 1:n ) := ^ 2~ r ^ J] p(x a:b ), (1) 
vec d (a,b)ev 

where Td(V) gives the number of nodes in the partition tree associated with V that have a depth less 
than d. This prior weighting is identical to how the Context Tree Weighting method [22] weights 
over tree structures, and is an application of the general technique used by the class of Tree Experts 
described in Section 5.3 of [4]. It is a valid prior, as one can show J2vec d 2~ rd ^ = 1 for all deN. 
Note that Algorithm 1 is a special case, using a prior 2^ Fd ^ for V E and otherwise, of the 
class of general algorithms discussed in [9]. As such, the main contribution of this paper is the in- 
troduction of this specific prior. A direct computation of Equation 1 is clearly intractable. Instead, 
an efficient approach can be obtained by noting that Equation 1 can be recursively decomposed. 

Lemma 1. For any depth deN, given a sequence of data x\- n G X n satisfying n < 2 d , 

PTW d (xi :n ) = -p(x lm ) + ^PTW d _! (x 1:k ) PTW d _i (x k +l:n) , (2) 

where k = 2 d ~ 1 . 

Proof. A straightforward adaptation of [22, Lemma 2], see Appendix A. □ 

3.3 Algorithm 

Lemma 1 allows us to compute PTWd(xi :n ) in a bottom up fashion. This leads to an algorithm 
that runs in 0(nd) time and space by maintaining a context tree data structure in memory. One of 
our main contributions is to further reduce the space overhead of ptw to O(d) by exploiting the 
regular access pattern to this data structure. To give some intuition, Figure 2 shows in bold the 
nodes in a context tree data structure that would need to be updated at times t = 2 and t = 3. The 
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Algorithm 1 Partition Tree Weighting - PTW d (a; 1:ri ) 
Require: A depth parameter d G N 
Require: A data sequence x\ :n G X n satisfying n < 2 d 
Require: A base probabilistic model p 

1: bj <- 1, Wj <r- 1, Tj 1, for < j < d 
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for t = 1 to n do 
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for j — % + 1 to d do 
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end for 
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for i = c/ — 1 to do 
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end for 


12 


end for 
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return wq 



key observation is that because our access patterns are performing a kind of depth first traversal of 
the context tree, the needed statistics can be summarized in a stack of size d. This has important 
practical significance, as the performance of many interesting base models will depend on how 
much memory is available. 

Algorithm 1 describes our 0(nd) time and 0(d) space technique for computing PTWd(xi :n ). It 
uses a routine, MSCB^i), that returns the most significant changed context bit; that is, for t > 1, 
this is the number of bits to the left of the most significant location at which the d-bit binary 
representations of t — 1 and t — 2 differ, with MSCB^l) := for all d E N. For example, 
for d = 5, we have MSCB 5 (4) = 4 and MSCB 5 (7) = 3. Since d = [logn], the algorithm 
effectively runs in 0(n log n) time and 0(log n) space. Furthermore, Algorithm 1 can be modified 
to run incrementally, as PTWd(xi :n ) can be computed from PTW d (x <n ) in 0(d) time provided the 
intermediate buffers bi,w,i, Ti for < % < d are kept in memory. 

3.4 Theoretical Properties 

We now provide a theoretical analysis of the Partition Tree Weighting method. Our first result 
shows that using ptw with a base model p is almost as good as using p with any partition in the 
class C d . 

Proposition 1. For all n EN, where d = [logn], for all x\. n G X,for all V G Cd, we have 



logPTW d (xi :n ) < T d (V) + Y -logp(x a:fe ). 



Proof. We have that 




logPTW d (x 1:n ) = - log 2 ~ Vd[V) II - r d{V)-logY[p{x a , b ) 
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where V is an arbitrary partition in Cd- Rewriting the last term completes the proof. □ 

The next result shows that there always exists a binary temporal partition (i.e., a partition in Cd) 
which is in some sense close to any particular temporal partition. To make this more precise, we 
introduce some more terminology. First we define C(V) : = {a}( a ,b)ev \ which is the set of 
time indices where an existing segment ends and a new segment begins in partition V. Now, if 
C(V) C C(V), we say V is a refinement of partition V. In other words, V is a refinement of V 
if V always starts a new segment whenever V does. With a slight abuse of notation, we will also 
use C(V) to denote the partition V. 

Lemma 2. For all n G N, for any temporal partition V G T n , with d = [log n\ , there exists a 
binary temporal partition V G Cd such that V is a refinement ofV and \ V'\ < \V\{ [log n\ +1). 

Proof. We prove this via construction. Consider a binary tree Tj with 1 < i < n formed from the 
following recursive procedure: 1. Set a = 1, b = 2 d , add the node (a, b) to the tree. 2. If a = i then 
stop; otherwise, add (a, )> ( a + L^irJ + ^' ^) as children to node (a, b), and then set (a, 6) to 
the newly added child containing i and goto step 2. 

Next define C(Ti) and X(Tj) to be the set of leaf and internal nodes of Tj respectively. Notice 
that \£(Ti)\ < d + 1 and that £(T;) G C d . Now, consider the set 

V:=ha,b)e |J £(r 4 ) : (a, 6)^ |J X(T, 
[ iec(p) iec(r) 

It is easy to verify that V G and that C(V) C C(V'). The proof is concluded by noticing that 
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|J £(T<) < ^(TOI^^Kd + lJ^I^Kflognl+l). □ 



Next we show that if we have a redundancy bound for the base model p that holds for any finite 
sequence of data generated by some class of bounded memory data generating sources, we can 
automatically derive a redundancy bound when using ptw on the piecewise stationary extension 
of that same class. 

Theorem 1. For all n G N, using PTW with d = [log n\ and a base model p whose redundancy is 
upper bounded by a non-negative, monotonically non-decreasing, concave function g : N — > W 
with g(0) = on some class Q of bounded memory data generating sources, the redundancy 

n 

mriogni+i; 

where /i is a piecewise stationary data generating source, and the data in each of the stationary 
regions V G T n is distributed according to some source in Q. Furthermore, Td(V) can be upper 
bounded independently of d by 2\V\ ( [logn] + 1). 

Proof. From Lemma 2, we know that there exists a partition V G Cd that is a refinement of 
V containing at most jT^K^ogn] + 1) segments. Applying Proposition 1 to V and using our 



log/i(xi :n ) - logPTW d (xi : „) < T d (V') + \V\g 



([lognl 
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redundancy bound g gives 

l0g/i(xi :n ) - log ?TW d (xi :n ) < log/i(x 1:n ) + T d (V') - ^ logp(x a:b ) 

= T d (V')+ J2 log/i /(a) (^)- ^ hgp(x a:b ) 

(a,b)£V (a,b)eV 

= T d {V')- Yl ^g P (x a:b ) + Yl log/i /(a) (^Na:c-l) 

(a,b)eV (a,b)eV (c,d)e?'(a,6) 

<T d (V')+ 9(b~a + l). 

(a,b)GV 

Since by definition g is concave, Jensen's inequality implies £V b ^ eP , g(b—a+l) < \V'\g(n/\V'\). 
Furthermore, the conditions on g also imply that ag{b/a) is a nondecreasing function of a > for 
any fixed b > 0, and so Lemma 2 implies 

E /( 6- a+ i)< W9 ([ W(riog " Kl+1) j)(rio g n 1+1 ), 

which completes the main part of the proof. The term T d (V) is proportional to the size of the 
partition tree needed to capture the locations where the data source changes. Since at least half of 
the nodes of a binary tree are leaves, and the number of leaf nodes is the same as the number of 
segments in a partition, for any d we have T d (V') < 2\V'\ < 2\V\ ( [logn] +1). □ 

We also remark that Theorem 1 can be obtained by combining Lemma 1 in [9] with our Lemma 
2 and the properties of g, as per the proof of Theorem 2 in [9]. However, to be self-contained, we 
decided to include the above short proof of the theorem. 

Removing the dependence on n and d. Our previous results required choosing a depth d in ad- 
vance such that d = [log n\ . This restriction can be lifted by using the modified coding distribution 
given by PTW(a;i :n ) := YYi=i PTW |"iogil {%i\x<i)- The next result justifies this choice. 

Theorem 2. For all neN, for all X\ :n G X n , we have that 

-logPTw(a; 1: „) < -logPTW d (>i :n ) + [logn] (log 3-1), 

where d = [log n] . 

Thus, the overhead due to not knowing n in advance is 0(log n). The proof of this result, given 
in Appendix A, is based on the fact that for any t, k E N satisfying 1 < t < 2 k , we have 

§PTW fc+1 (xi :t ) < PTW fc (xi :4 ), (3) 

which implies that each time the depth of the tree is increased, the algorithm suffers at most an 
extra log (3/2) penalty. 

Algorithm 1 can be straightforwardly modified to compute ptw(xi : „) using the same amount 
of resources as needed for PTW d (xi :n ). The main idea is to increase the size of the stack by one 
and copy over the relevant statistics whenever a power of two boundary is crossed. Alternatively, 
one could simply pick a sufficiently large value of d in advance. This is also justified, as, based on 
Equation 3, the penalty for using an unnecessarily large k > d can be bounded by 



- log PTW fc (xi :n ) < - log PTW d (x lm ) + (k-d) log 3 



2" 
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4 Applications 

We now explore the performance of Partition Tree Weighting in a variety of settings. 

Binary, Memoryless, Piece wise Stationary Sources. First we investigate using the well known 
KT estimator [13] as a base model for ptw. We begin with a brief overview of the KT estimator. 
Consider a sequence x 1:n E {0, 1}™ generated by successive Bernoulli trials. If a and b denote the 
number of zeroes and ones in x 1:n respectively, and 9 E [0, 1] denotes the probability of observing 
a 1 on any given trial, then Pr(x 1:n | 9) = 9 b (l — 9) a . One way to construct a distribution over 
X\ :n , in the case where 9 is unknown, is to weight over the possible values of 9. The KT-estimator 
uses the weighting w{9) := Beta(^,^) = n~ 1 9~ 1 / 2 (l — 9)~ 1 / 2 , which gives the coding distribution 
KT(x 1:n ) := Jq 1 9 b (l—9) a w(9) d9 . This quantity can be efficiently computed online by maintaining 
the a and b counts incrementally and using the chain rule, that is, Pr(x n+1 = l|xi :n ) = 1 — 
Pr(x n+ i = 0|xi :n ) = (b + l/2)/(n + 1). Furthermore, the parameter redundancy can be bounded 
uniformly; restating a result from [22], one can show that for all n E N, for all x l n E X n , for all 

0e[o,i], 

log 9\l - 9) a - log KT(x 1:n ) < § log(n) + 1. (4) 

We now analyze the performance of the KT estimator when used in combination with PTW. Our 
next result follows immediately from Theorem 1 and Equation 4. 

Corollary 1. For all n E N, for all xi :n E {0, 1}™, if p, is a piecewise stationary source seg- 
mented according to any partition V E T n , with the data in segment i being generated by i.i.d. 
Bernoulli(9i) trials with 9{ E [0, l]/or 1 < i < \V\, the redundancy of the ptw-kt algorithm, 
obtained by setting d = [log n \ and using the KT estimator as a base model, is upper bounded by 



W) + ^log 



n 



|p|(rio g ni + i; 



(flogn] + 1) + \V\(\\ogn] + 1) 



Corollary 1 shows that the redundancy behavior of ptw-kt is O (\V\{\ogn) 2 ). Thus we expect 
this technique to perform well if the number of stationary segments is small relative to the length 
of the data. This bound also has the same asymptotic order as previous [20, 10, 9] low complexity 
techniques. 

Next we present some experiments with ptw-kt on synthetic, piecewise stationary data. We 
compare against techniques from information theory and online learning, including (i) Live and Die 
Coding (lad) [20], (ii) the variable complexity, exponential weighted averaging (VCW) method 
of [9], and (iii) the dec-kt estimator [14, 17], a heuristic variant of the KT estimator that expo- 
nentially decays the a,b counts to better handle non-stationary sources. Figure 3 illustrates the 
redundancy of each method as the number of change points increases. To mitigate the effects of 
any particular choice of change points or 9i, we report results averaged over 50 runs, with each 
run using a uniformly random set of change points locations and 0j. 95% confidence intervals are 
shown on the graphs, ptw performs noticeably better than all methods when the number of seg- 
ments is small. When the number of segments gets high, both PTW and VCW outperform all other 
techniques, with VCW being slightly better once the number of change points is sufficiently large. 
The vcw(g) technique interpolates between a weighting scheme very close to the linear method 
in [21] and LAD; however, the complexity of this algorithm when applied to the KT estimator is 
0(gn logn), so with g = 5, the runtime is already 5 times larger than ptw. 
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2 4 6 8 10 12 14 16 18 20 5 10 15 20 25 30 35 40 

Number of Change Points Number of Change Points 

(a) n = 8192 over splits {0, 1,2,..., 20} (b) n = 65536 over splits {0, 1,2,..., 40} 

Figure 3: Average redundancy of various estimators on binary data for increasing number of change points. 
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Table 1 : Performance (average bits per byte) on the Calgary Corpus 



Additionally, we evaluated the same set of techniques as a replacement to the KT estimator 
for the memoryless model used within Context Tree Switching (CTS) [17], a recently introduced 
universal data compression algorithm for binary, stationary Markov sources of bounded memory. 
Performance was measured on the well known Calgary Corpus [3]. Each result was generated 
using CTS with a context depth of 48 bits. The results (in average bits per byte) are shown in 
Table 1. Here we see that PTW-KTconsistently matches or outperforms the other methods. The 
largest relative improvements are seen on the non-text files, GEO, OBJl, OBJ2 and PIC. While the 
performance of VCW could be improved by using a g > 5, it was already considerably slower than 
the other methods. 

Tracking. PTW can also be used to derive an alternate algorithm for tracking [11] using the 
code-length loss. Consider a base model p that is a convex combination of a finite set 
At := {ui, z/ 2 , . . . , v\m\} °f fc-bounded memory models, that is, 

p(Xi-. n | Xi_ fc: o) := ) J W v .V i (xi* l \x 1 -.ktt), (5) 

where X\-k-x) G X k denotes the initial (possibly empty) context, each is a fc-bounded memory 
probabilistic data generating source (that is, Vi(x t \x <t ) = v i {x t \x t -k:t-i) f° r an Y t), w u G E and 
w v > for all v G M, and J2 ueM w u = 1. We now show that applying PTW to p gives rise to a 
model that will perform well with respect to an interesting subset of the class of switching models. 
A switching model is composed of two parts, a set of models M. and an index set. An index set 
ii m with respect to M. is an element of {1,2,..., |.M|} n . Furthermore, an index set i 1:n can be 
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naturally mapped to a temporal partition in T n by processing the index set sequentially, adding a 
new segment whenever i t ^ i t+1 for 1 < t < n. For example, if \A4\ > 2, the string 1111122221 
maps to the temporal partition {(1, 5), (6, 9), (10, 10)}. The partition induced by this mapping will 
be denoted by S(ii :n ). A switching model can then be defined as 

*s*l:n ' \ \ (,%a-b I X a — k:a— l) > 

where we have adopted the convention that the previous symbols at each segment boundary define 
the initializing context for the next bounded memory source. 2 The set of all possible switching 
models for a sequence of length n with respect to the model class M. will be denoted by Z n (M). 
If we now let r(xi :n ) denote PTW[-i ogn ] (xi in ) using a base model as defined by Equation 5, we 
can state the following upper bound on the redundancy of r with respect to an arbitrary switching 
model. 

Corollary 2. For all n G N, for any x\. n G X n and for any switching model £j 1:n G X n {M), we 
have 

- logr(x 1:n ) + log& 1: „(x 1:n ) < (2 + k) \S{i 1:n )\ (\\ogn\ + 1), (6) 
where k := max ueM — \og(w u ). 

Proof Using - log p(xi :t | xi_ fc:0 ) = -log{%2 u . eM w Vi Vi(xi : t\x 1 - k .. )} < -loguv - 
logu*(xi;t | xi-k-.o) for any v* G Ai and t G N, we see that the redundancy of p with respect 
to any single model in A4 is bounded by k := max^g^vj — log(uv). Combining this with Theorem 
1 completes the proof. □ 

Inspecting Corollary 2, we see that there is a linear dependence on the number of change points 
and a logarithmic dependence on the sequence length. Thus we can expect our tracking technique 
to perform well provided the data generating source can be well modeled by some switching model 
that changes infrequently. The main difference between our method and [11] is that our prior 
depends on additional structure within the index sequence. While both methods have a strong 
prior bias towards favoring a smaller number of change points, the ptw prior arguably does a 
better job of ensuring that the change points are not clustered too tightly together. This benefit 
does however require logarithmically more time and space. 

5 Extensions and Future Work 

Along the lines of [9], our method could be extended to a more general class of loss functions 
within an online learning framework. We also remark that a technique similar to Context Tree 
Maximizing [18] can be applied to ptw to extract the best, in terms of Minimum Description 
Length [8], binary temporal partition for a given sequence of data. Unfortunately we could not 
find a way to avoid building a full context tree for this case, which means that 0(n log n) memory 
would be required instead of the O(logn) required by Algorithm 1. Another interesting follow up 
would be to generalize Theorem 3 in [17] to the piecewise stationary setting. Combining such a 
result with Corollary 1 would allow us to derive a redundancy bound for the algorithm that uses 

2 This is a choice of convenience. One could always relax this assumption and naively encode the first k symbols of 
any segment using a uniform probability model, incurring a startup cost of k log \ X\ bits before applying the relevant 
bounded memory model. This would increase the upper bound in Equation 6 by |5(ii :n )| ([logn] + l)fclog \X\. 
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ptw-kt for the memoryless model within CTS; this bound would hold with respect to any binary, 
piecewise stationary Markov source of bounded memory. 

6 Conclusion 

This paper has introduced Partition Tree Weighting, an efficient meta- algorithm that automati- 
cally generalizes existing coding distributions to their piecewise stationary extensions. Our main 
contribution is to introduce a prior, closely related to the Context Tree Weighting method, to effi- 
ciently weight over a large subset of possible temporal partitions. The order of the redundancy and 
the complexity of our algorithm matches those of the best competitors available in the literature, 
with the new algorithm exhibiting a superior complexity -performance trade-off in our experiments. 

Acknowledgments. The authors would like to thank Marcus Hutter for some helpful comments. 
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Appendix A. Supplementary Proofs 



Lemma 1. For any depth d EN, given a sequence of data Xi :n E X n satisfying n < 2 , 

PTW d (xi :rt ) = -p{x lm ) + ^PTW d _! (x 1:k ) PTW d _i (x k +l:n) , (2) 

where k = 2 d ~ 1 . 

Proof. This is a straightforward adaptation of Lemma 2 from [22]. We use induction on d. First 
note that PTW (x 1:n ) = p(xi :n ) by definition, so the base case of d = 1 holds trivially. Now assume 
that Equation 2 holds for some depth d—1, and observe that 

pTw,(x 1:n ) = e 2 ~ Td{v) n 

vec d (i) (ij)ev 

= ip(x, n ) + e 2^ n ^ 

vec d (i)\{{(i,2d)}} (i,j)ev 



r:s) 



VieC d -!(l) (i,j)&Vi (r,s)£p 2 

v 2 ec d - 1 (k+i) 

= w+i E 2-rd - i(pi) n^))( E 2 ~ rd - i(pa) n^ 

\P 1 eC d _ 1 (l) (ij)eT'i / yPnECi-iik+l) (r,s)eP 2 

= ^P(^i:n) + ^PTW d _i (a;i :fe ) PTW d _i 

where k := 2 d - 1 . The third step uses the property r d (Pi U P 2 ) = r d _i(Pi) + rVi(P 2 ) + 1, 
which holds when V\ E Cd-i(l) and 7^2 £ Cd-i{k + 1). The final step applies the inductive 
hypothesis. □ 

Theorem 2. For all n E N, for all X\- n E X n , we have that 

-logPTw(a;i: n ) < -logPTW d (xi :n ) + [log n] (log 3-1), 

where d = [log n~\ . 

Proof. We begin by showing that for any t < 2 k we have 

fPTW fc+1 (> 1:t ) < PTW fc (a;i:i). (7) 

Using Lemma 1, 

PTW fc+ i(.Ti :t ) = -p(x 1:t ) + ^PTW fc (x 1:2 h) PTW fe (X 2 k +1:2 k+i) = -p(x Ut ) + ^PTW fc (x 1:t ) (8) 

holds for t <2 h . On the other hand, 

pTw fe (z 1:t ) = e 2- rfc(p) n ^) = ^(^) + e 2_rfc(p) n fa**) ^ i^i--*)' 

P6C fc (0,6) ~" Pec fc \{{(i,2 fe )}} 0,&)eP 
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which, together with Equation 8 proves Equation 7. The latter allows us to derive a lower bound 
on ptw(xi ; „), by noting that 



n 



PTW(xi :n ) = ]^[PTW rio g il (x i |x <i ) 

= PTW (Xl) ( JJPTW a (a?2«-i+i.2«|a;i:2«»-l)l PTW d(^-i+l:nNl:2d-i; 
PTW a (xi:2») \ PTW d (xi :n ) 



\a=l 
U-l 

PTWq(xi) ( ] [ 



^ PTW«(iia.-i) J PTW d (x 1:2d -i) 
x tt PTW a _i(a;i.2a-i) 

v ' J. J. PTW f TN „„_i 1 



a=l 



f 

PTW a (Xl :2 a-l 

> (|) a PTWd(ari.n). 
Hence, taking the negative logarithm of both sides we obtain 

-logPT\v(xi :n ) < -logPTW d (xi.„) + dlog(|) = -logPTW d (x 1:n ) + [log n] (log 3- 1) 



□ 
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